Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput by ChrisHegarty · Pull Request #144557 · elastic/elasticsearch

ChrisHegarty · 2026-03-19T12:48:54Z

This PR builds on the zero-copy DirectAccessInput infrastructure introduced in #141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring.

During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new DirectAccessInput.withByteBufferSlices API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%.

Key changes:

DirectAccessInput.withByteBufferSlices (libs/core): New bulk multi-region zero-copy access method, complementing the single-region withByteBufferSlice from Enable zero-copy SIMD vector scoring on searchable snapshots (frozen tier) #141718. Implementations in SharedBlobCacheService.CacheFile, FrozenIndexInput, BlobCacheIndexInput, and StoreMetricsIndexInput handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed.
BULK_GATHER native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through VectorSimilarityFunctions and JdkVectorLibrary.
IndexInputUtils.withSliceAddresses (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through MemorySegmentAccessInput (pointer arithmetic) or DirectAccessInput (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls.
ByteVectorScorer and Int7SQVectorScorer (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. GatherScorer extracted as a shared top-level interface.
Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.

elasticsearchmachine · 2026-03-19T12:51:33Z

Hi @ChrisHegarty, I've created a changelog YAML for you.

ldematte

I gave it a quick first pass, concentrating especially on the native part and how it interacts (how we pass the dataset). Looks good!

libs/simdvec/native/src/vec/headers/vec_common.h

libs/core/src/main/java/org/elasticsearch/core/DirectAccessInput.java

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java

ldematte · 2026-03-19T13:56:34Z

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java

+        long[] offsets,
+        int length,
+        int count,
+        long[] addrs,


I suppose this is a parameter because we'd likely reuse it, and it's not directly a MemorySegment (of size count * ADDRESS.bytes) because we want to call it from code that does not have the preview things?

yeah. This could be a premature optimisation. Lemme revert it, as it's not clear that it's worth it at this point.

No it's OK I think, just wanted to confirm I understood it correctly

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java

...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java

While working on bulk sparse scoring (#144557), I noticed that INT8 and FLOAT32 were missing testBulkIllegalDims coverage that INT7U, INT4, and BBQ already have. Extracting this into a small targeted PR. Both new tests verify IOOBE for count overflow, negative count, negative dims, and undersized result buffer, matching the existing pattern in JDKVectorLibraryInt7uTests.

While working on bulk sparse scoring (#144557), I noticed the existing BULK_OFFSETS tests only use random offsets. Random offsets probabilistically cover duplicates and may happen to produce a sequential pattern, but neither case is guaranteed or verified explicitly, so I added two new tests make the patterns deterministic and assert specific properties that random offsets do not. I added these to INT7U only since the offset dispatch logic is the same array_mapper template across all element types. A bug in offset handling would surface here; other type-specific arithmetic is already covered by the existing per-type random-offset tests.

…144645) While working on bulk sparse scoring (#144557), I noticed that ByteVectorScorerFactoryTests only tested per-ordinal score() via the supplier path. This PR adds bulk scoring and query-side scorer coverage, extracted from ongoing work on bulk sparse scoring (#144557). The test structure is designed so that SNAP directory variants can be added alongside the MMap tests once DirectAccessInput support lands.

elasticsearchmachine · 2026-03-23T10:39:45Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java

libs/simdvec/src/test21/java/org/elasticsearch/simdvec/internal/IndexInputUtilsTests.java

thecoop · 2026-03-23T15:30:32Z

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java

+         * <ol>
+         *     <li>Array of 8-byte longs containing the native memory address of each vector</li>
+         *     <li>Single vector to score against</li>
+         *     <li>Number of dimensions, or for bbq, the number of index bytes</li>


this isn't for BBQ (yet?)

I just didn't write the native code for it yet, but given how this is progressing - the native mapper template should be trivial. lemme take a look.

BBQ can use a similar technique, but the code is a bit more involved. Let's do it as a follow up.

Do we need to do this for BBQ/DiskBBQ? I think that in that case data is always contiguous...

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java

thecoop

A few test tweaks, but otherwise vector side looks good

…yteBufferSlices

) While working on bulk sparse scoring (#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.

…tic#144643) While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.

…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.

…tic#144643) While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.

…rectAccessInput (elastic#144557) This PR builds on the zero-copy DirectAccessInput infrastructure introduced in elastic#141718 to extend native SIMD bulk vector scoring to searchable snapshot (SNAP) data. Previously, native bulk scoring was limited to memory-mapped files (MemorySegmentAccessInput); SNAP inputs fell back to one-at-a-time Java scoring. During HNSW graph traversal, the search algorithm scores a batch of candidate neighbor vectors against the query vector in each step. With this change, those batches can now be scored in a single native SIMD call even when the underlying data lives in the shared blob cache rather than a memory-mapped file. The new `DirectAccessInput.withByteBufferSlices` API provides zero-copy access to the cached regions' direct byte buffers, allowing native memory addresses to be extracted and passed directly to the bulk-gather scoring functions without any heap copying. When any vector in a bulk batch crosses a cache region boundary, the entire batch falls back to one-at-a-time scoring. In practice this should be rare: for typical configurations the per-batch fallback probability is well under 1%. Key changes: * `DirectAccessInput.withByteBufferSlices` (libs/core): New bulk multi-region zero-copy access method, complementing the single-region `withByteBufferSlice` from elastic#141718. Implementations in `SharedBlobCacheService.CacheFile`, `FrozenIndexInput`, `BlobCacheIndexInput`, and `StoreMetricsIndexInput` handle offset adjustment for sliced inputs and graceful fallback (return false) when regions cross cache boundaries or are not mmap-backed. * `BULK_GATHER` native operation (libs/simdvec/native): New C++ bulk-gather functions for aarch64 and amd64 that accept an array of native memory addresses (one per vector) instead of requiring contiguous memory. Corresponding BULK_GATHER operation plumbing through `VectorSimilarityFunctions` and `JdkVectorLibrary`. * `IndexInputUtils.withSliceAddresses` (libs/simdvec): Utility that resolves file byte offsets to native memory addresses, dispatching through `MemorySegmentAccessInput` (pointer arithmetic) or `DirectAccessInput` (withByteBufferSlices). Includes reachabilityFence calls to ensure backing memory remains valid during native calls. * `ByteVectorScorer` and `Int7SQVectorScorer` (libs/simdvec): Refactored to use withSliceAddresses for bulk scoring, supporting both mmap and SNAP inputs through a unified code path. `GatherScorer` extracted as a shared top-level interface. * Test coverage: New tests across SharedBlobCacheServiceTests, FrozenIndexInputTests, BlobCacheIndexInputTests, StoreMetricsIndexInputTests, IndexInputUtilsTests, and ByteVectorScorerFactoryTests covering bulk access, offset adjustment on sliced inputs, cross-region boundary fallback, eviction scenarios, and the super.bulkScore() fallback path.

…tic#144643) While working on bulk sparse scoring (elastic#144557), I noticed that checkBulkOffsets and checkBBQBulkOffsets validated segment sizes but not individual offset values. An out-of-range or negative offset would silently read memory beyond the data segment, risking a crash or silently wrong results. The solution is to replace the sequential size check with per-offset validation that checks each offset points to a valid vector within the data segment. The O(count) loop should be negligible relative to the O(count * dims) native call, but we've made the checks conditional on asserts to avoid any potential negative cost of this, and asserts should be good enough given our testing. Note: INT4 skips size=2 (packedLen=1) because checkBulkOffsets computes rowBytes = packedLen * 4 / 8 which truncates to 0 via integer division, making the bounds check trivially pass. This is a pre-existing issue with how INT4 passes packed byte length (not element count) as the length parameter to the generic check formula. We can address this separately, if needed.

initial withByteBufferSlices

0d88d5b

ChrisHegarty added :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Mar 19, 2026

elasticsearchmachine added the v9.4.0 label Mar 19, 2026

ChrisHegarty added the >enhancement label Mar 19, 2026

Update docs/changelog/144557.yaml

4444d5d

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 19, 2026

elasticsearchmachine and others added 2 commits March 19, 2026 13:00

[CI] Auto commit changes from spotless

6bfc44d

itr

a9268d0

ldematte reviewed Mar 19, 2026

View reviewed changes

ChrisHegarty added 3 commits March 19, 2026 15:22

remove unnecessary path

c8033dd

rename gather to sparse

92d40d7

Refactor native mappers to return pointers directly

686c156

ChrisHegarty commented Mar 19, 2026

View reviewed changes

...ugin/blob-cache/src/main/java/org/elasticsearch/blobcache/shared/SharedBlobCacheService.java Show resolved Hide resolved

ChrisHegarty added 4 commits March 23, 2026 08:59

Merge branch 'main' into withByteBufferSlices

00f1c85

restore

ee98fe0

bump version

5d00766

fix and publish

8f2ef5e

ChrisHegarty marked this pull request as ready for review March 23, 2026 10:39

ChrisHegarty requested a review from a team as a code owner March 23, 2026 10:39

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/native/src/main/java/org/elasticsearch/nativeaccess/VectorSimilarityFunctions.java Show resolved Hide resolved

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/src/test21/java/org/elasticsearch/simdvec/internal/IndexInputUtilsTests.java Outdated Show resolved Hide resolved

thecoop reviewed Mar 23, 2026

View reviewed changes

libs/simdvec/src/main21/java/org/elasticsearch/simdvec/internal/IndexInputUtils.java Show resolved Hide resolved

thecoop approved these changes Mar 23, 2026

View reviewed changes

ChrisHegarty and others added 5 commits March 23, 2026 16:48

use org.hamcrest.Matchers.instanceOf

100e31d

[CI] Auto commit changes from spotless

ce567f7

int64_t* -> void* const* addresses

f27b600

bump library version

debc09f

Merge branch 'main' into withByteBufferSlices

4cf3800

ChrisHegarty changed the title ~~Add bulk-gather native vector scoring for searchable snapshots via DirectAccessInput~~ Add bulk-sparse native vector scoring for searchable snapshots via DirectAccessInput Mar 24, 2026

ChrisHegarty and others added 11 commits March 24, 2026 09:40

Merge branch 'main' into withByteBufferSlices

03f5eb1

Merge remote-tracking branch 'chegar/withByteBufferSlices' into withB…

d989532

…yteBufferSlices

test comments

28f9538

scoring with 0 numNodes

e440ebb

checkargs

e47f813

Merge branch 'main' into withByteBufferSlices

d4e659f

itr

03448f2

Merge branch 'main' into withByteBufferSlices

cede413

Merge branch 'main' into withByteBufferSlices

f75b960

Merge branch 'main' into withByteBufferSlices

14bfaeb

Merge remote-tracking branch 'chegar/withByteBufferSlices' into withB…

c7bdb13

…yteBufferSlices

ChrisHegarty enabled auto-merge (squash) March 25, 2026 17:43

ChrisHegarty disabled auto-merge March 25, 2026 17:43

ChrisHegarty merged commit a33042d into elastic:main Mar 25, 2026
33 of 53 checks passed

Conversation

ChrisHegarty commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 19, 2026

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldematte Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

thecoop Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisHegarty Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ldematte Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thecoop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ChrisHegarty commented Mar 19, 2026 •

edited

Loading